Comparison of two propensity score-based methods for balancing covariates: the overlap weighting and fine stratification methods in real-world claims data

Background Two propensity score (PS) based balancing covariate methods, the overlap weighting method (OW) and the fine stratification method (FS), produce superb covariate balance. OW has been compared with various weighting methods while FS has been compared with the traditional stratification method and various matching methods. However, no study has yet compared OW and FS. In addition, OW has not yet been evaluated in large claims data with low prevalence exposure and with low frequency outcomes, a context in which optimal use of balancing methods is critical. In the study, we aimed to compare OW and FS using real-world data and simulations with low prevalence exposure and with low frequency outcomes. Methods We used the Texas State Medicaid claims data on adult beneficiaries with diabetes in 2012 as an empirical example (N = 42,628). Based on its real-world research question, we estimated an average treatment effect of health center vs. non-health center attendance in the total population. We also performed simulations to evaluate their relative performance. To preserve associations between covariates, we used the plasmode approach to simulate outcomes and/or exposures with N = 4,000. We simulated both homogeneous and heterogeneous treatment effects with various outcome risks (1-30% or observed: 27.75%) and/or exposure prevalence (2.5-30% or observed:10.55%). We used a weighted generalized linear model to estimate the exposure effect and the cluster-robust standard error (SE) method to estimate its SE. Results In the empirical example, we found that OW had smaller standardized mean differences in all covariates (range: OW: 0.0–0.02 vs. FS: 0.22–3.26) and Mahalanobis balance distance (MB) (< 0.001 vs. > 0.049) than FS. In simulations, OW also achieved smaller MB (homogeneity: <0.04 vs. > 0.04; heterogeneity: 0.0-0.11 vs. 0.07–0.29), relative bias (homogeneity: 4.04–56.20 vs. 20–61.63; heterogeneity: 7.85–57.6 vs. 15.0-60.4), square root of mean squared error (homogeneity: 0.332–1.308 vs. 0.385–1.365; heterogeneity: 0.263-0.526 vs 0.313-0.620), and coverage probability (homogeneity: 0.0–80.4% vs. 0.0-69.8%; heterogeneity: 0.0-97.6% vs. 0.0-92.8%), than FS, in most cases. Conclusions These findings suggest that OW can yield nearly perfect covariate balance and therefore enhance the accuracy of average treatment effect estimation in the total population. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-024-02228-z.


Background
Due to infeasibility of running a randomized experiment, observational data are often used to estimate the population health effects of interventions.When estimating plausibly causal effects using observational data, it is necessary to reduce imbalance in the empirical distribution of the pretreatment confounders between the treated and control groups [1].Lowering imbalance can reduce the degree of model dependence for the statistical estimation of causal effects [1][2][3][4], and thus reduces inefficiency and bias [1].To achieve balanced covariates, propensity scores (PS) have become a cornerstone in observational studies aimed at estimating causal effects [5,6].PS are defined as the predicted probability of receiving a particular treatment (or exposure) for the given covariate realizations of a study subject.
In this paper, we study PS-based approaches to estimate the average treatment effect in the total population (ATE).There are three common types of balancing methods via PS: matching, stratifying, and weighting.Among matching methods, the PS matching method (PSM) is the most commonly used in practice [1].It is simple and intuitive by reducing the multidimensional covariate space to one dimension.Despite its widespread adoption, a large sample size is required as it discards some subjects who are not matched.In addition, PSM has been shown to increase model "imbalance, inefficiency, model dependence, and bias, " which is not the case with most other matching methods [1].Among the stratification methods, the most common one is to stratify subjects into five quintiles of PS.With the stratum boundaries determined by PS distribution in the exposed and the comparison group combined, it eliminates approximately 90% of bias due to measured confounding [7].However, when exposure is infrequent, it may result in all exposed subjects being aggregated in one or more extreme strata [5,8].The fine stratification weights method (FS), a recent method, can solve this issue by increasing number of strata and by determining stratum boundaries based on PS distribution in exposed group only.It has been shown to gain greater efficiency than the traditional one [8].Among the weighting methods, inverse probability weighting (IPW) is popular but performs poorly when some subjects have extreme PS [9][10][11].The PS based overlap weighting method (OW), another recent method, overcomes IPW's extreme weight issue and produces impressive covariate balance [12,13].
OW has been theoretically proven to have small-sample exact balance property [12].That is, it leads to exact balance on the mean of every covariate when the PS is estimated by a logistic regression.It is less sensitive to model misspecification compared to the inverse probability weighting method (IPW) in a simulation study [14].Despite these features, to our knowledge, OW has only been evaluated by comparing with weighting methods such as IPW and trimmed IPW [9][10][11][12][13][14]. Little is known about the relative performance of OW compared with other types of balancing methods including matching and stratification methods [15].In addition, OW has not been evaluated in large claims data with low prevalence exposure and/or with low frequency events (i.e., outcomes), a context in which optimal use of balancing methods is critical.
Furthermore, matching on PS is limited by exclusion of subjects without a suitable match leading to a non-representative population and a loss of statistical power [16].PSM including 1:1, 1:5, and full matching have less model precision than FS in at least two claims studies [8,17].Therefore, we aimed to compare OW with FS only, both relatively new and promising methods, using real-world and simulated claims data in settings with infrequent exposure and/or with low prevalence outcomes.

Empirical example
We used a cohort of 42,628 Texas State Medicaid beneficiaries, aged 18-64, diagnosed with type 2 diabetes, who had at least one primary care visit between January 2012 and December 2012.About 10.55% (n = 4,498) of the patients received the majority of their primary care at federally qualified health centers (FQHCs) (exposure), while the rest (89.45%, n = 38,130) received care at non-FQHCs (control).Researchers analyzed whether or not those patients who had routine primary care at FQHCs had fewer hospitalizations and emergency room visits than the non-FQHC patients.Five continuous and 12 binary covariates were selected based on clinical relevance and previous literature.The empirical example has 10.55% exposure rate which is near rare (typically < 10% considered as rare) and hospitalization quite often is a rare outcome.
The study was reviewed by the University of Chicago Institutional Review Board and determined to be nonhuman subject research.

Overlap weighting method (OW)
The OW method mimics a randomized trial by assigning appropriate weights to generate a clinically relevant target population -overlapped between groups.That is, a subject in the treatment group receives a weight that is the probability of not receiving the treatment (i.e., 1 -PS), while a subject in the control group receives a weight that is the probability of receiving the treatment (i.e., PS).As a consequence, the two groups have overlapped PS distributions.Those subjects overlapped between the two groups in the PS distribution receive more weight, while those who are only in one non-overlapping tail of the PS distribution receive less.Also OW does not prune any subjects.The target of inference, advantages, and disadvantages of the OW and FS methods are compared in eTable 1.

PS-based fine stratification method (FS)
The FS method proposed by Desai et al. (2017) [8] finds matched balancing scores (PS) via stratification with a large number of PS strata (much larger than five in the traditional stratification method), and then assigns appropriate weights to subjects per stratum.It minimizes any loss of exposed subjects that may be relevant especially when treatment exposure is rare, because losing subjects decreases precision of the treatment effect estimates [8].The method only excludes subjects whose PS are not in the overlapped PS regions between the two groups.There are two steps for implementation: (1) create equally-sized PS strata by ranking only treated/exposed subjects based on PS values and then assign control/unexposed subjects to these strata; (2) following stratification, in all strata with at least one treated patient and one control patient, weights are calculated (see below).
Regarding the optimum number of strata, Desai et al. stated that it may be difficult to make general recommendations because it may depend on the prevalence of a rare-exposed treatment [8].The number of PS strata they used was 10, 50, or 100 and all produced similar bias and precisions in their simulations.In this study, we chose 20 PS strata, their stratification width about 0.05 on average, smaller than the recommended PS width of 0.2 [7]. .Each stratum had about 225 subjects from the FQHC-exposed group in our empirical example.

Target of inference (estimand) and weights
In the study, since each patient can switch their primary care visits between FQHCs and non-FQHCs, we estimated ATE among all patients [6].In literature, there are two existing approaches to assign weights for ATE.One approach is to generate equal total weights between groups, denoted as ' ATE-equ, ' is based on Ntotal in stratum i/Ntotal exp in stratum i for the exposed group and N"total in stratum i"/N"total unexp in stratum i" for the unexposed group [18,19].The other approach, denoted as ' ATE-unequ, ' is based on (Ntotal in stratum i/Ntotal)/ (Ntotal exp in stratum i/Ntotal exp") for the exposed group and (Ntotal in stratum i/Ntotal)/(Ntotal unexp in stratum i/Ntotal unexp) for the unexposed group [6,20].This alternative approach results in the total weight in one group equivalent to the sample size in that group.The two weighting methods are very similar, except that ATE-unequ has a weight of N"total exp" (N"total unexp") for the exposed (unexposed) group.
As a weighting method, OW targets the overlap population and its corresponding estimand is referred to as ATE on the overlap population (ATO) [12].Zhou et al. (2020) stated that OW was part of a class of balancing weights that target a judiciously chosen subpopulation of interest from which an estimand is closely related to ATE [14].Not surprisingly, OW's total weights are identical between groups, the same as ATE-equ.

Evaluation of performance via the empirical example
In the empirical study for the method evaluation [21], we used the standardized mean difference between the two groups (SMD), Mahalanobis balance (MB), and final sample size of retaining sample.SMD is a distance measure of balancing criterion for each covariate [22].MB is a metric that measures the distance between two group mean vectors of all covariates and is standardized by the sample covariance matrix [1,17,23].Final sample size [8] is a measure of model precision and can be important for a rare event outcome.

Simulations
After balancing covariates, we determined relative performance of OW and FS for model bias and precision.The degree of covariate imbalance is proportional to bias in the treatment effect [24], and final sample size is associated with precision.However, due to lack of knowledge of the true FQHC effect in the empirical example, we do not know the real size of model bias and precision, especially, in a setting of infrequent exposure/outcome.Therefore, we conducted simulations.
Instead of using ordinary simulation approaches that do not capture important features that may exist between covariates, we chose the plasmode approach to conduct simulations [25][26][27].Through resampling with replacement from all the observed covariates, plasmode can preserve the associations between covariates with potential complex covariance structures, which are common in healthcare claims databases [25].Via a logistic regression model, details were provided in Appendix A (including R code) on how to simulate an outcome and/or an exposure factor.There were two logit models for outcome and exposure, respectively, with two different linear combinations of covariates.We simulated two types of treatment effects: homogeneity and heterogeneity.To simulate a heterogeneous treatment effect, we replaced constant treatment effect with an interaction term between exposure and sex (or age): sex, as an example, represented as a binary heterogeneity factor and age was a continuous one [25].Age was standardized first before conducting a heterogeneous treatment effect.The simulation settings can be found in eTable 2.
To examine settings with infrequent outcome and/or occasional exposure, for each type of treatment effect, we simulated four scenarios by varying outcome risks and/or exposure prevalence.Scenarios simulated outcome risks of 1%, 10%, and 30% with the observed exposure prevalence (10.55%) or with 2.5% simulated exposure.We also simulated exposure prevalence of 2.5%, 10%, and 30% with the observed outcome risk (27.75%) or with 1% simulated outcome risk.We set the true FQHC effect to be one as a coefficient to both homogeneous and heterogeneous treatment terms.For each scenario, we simulated 500 datasets, each with the sample size of 4,000.For each simulated dataset, a weighted generalized linear model (GLM) with the log link function in the SAS GENMOD procedure was used to estimate the FQHC effect, i.e., natural logarithms of relative risk ratio [8].Due to nonuniform weights included in our GLMs, instead of using the default delta method, we used the cluster-robust standard error method to estimate standard error (SE) of the effect [28,29].After covariates balanced, adjusting further for covariates is unnecessary because it is unrelated to the treatment independent variable [30].That is, a simple difference in means on the balanced data can estimate the causal effect.

Evaluation of performance in simulations
In the simulation study, we used the following criteria to evaluate the methods: mean MB, mean relative bias (rbias), standard deviation (SD) of rbias, square root of mean squared error of bias (rMSE), average SE of the estimated effect, average final sample size, two coverages [8,17,23], and significance.Relative bias is the percent relative difference, 100(estimated effect -truth)/truth [12][13][14].The rMSE combines squared bias (not rbias) and its variance.The coverage is a probability of the 95% confidence interval (CI) that covers the true effect (denoted as 'coverage') [12,13,31].It can be obtained with two steps: (1) to compute a CI via our weighted GLM and (2) then to calculate a proportion of samples covering the true effect among 500 simulations.In our simulation study, a CI could cover both the non-zero true effect and zero, and statistical significance may be influenced.Therefore, to distinguish from the traditional coverage, we generated another one ('coverageT') counting those CIs that cover the true effect but not zero.In some cases where CIs were too narrow to cover the true effect (see results below), significance was defined as a proportion of samples obtaining a significant effect (by a weighted GLM with a two-sided p-value < 0.05).The two coverages and significance are associated with model precision [32], but more targeted to detect the true treatment effect.Among the criteria, the least useful criterion is SE because it measures variability of effect in a model, not bias, precision, or measures in covariate balance.

Unmatched subjects
Although matching was not involved in the study, via simulations we discovered whether pruning those clearly unmatched subjects has any effects on model bias and precision.The unmatched subjects are those who are available from one group but not from the other group in terms of combination cells of binary covariates.

Summary of all methods
In summary, we used two datasets for performance evaluation: one was the original full dataset (denoted as 'F'); the other was the dataset (denoted as 'X') after deleting those unmatched subjects.We also evaluated the two weighting approaches.Therefore, there were a total of seven methods for comparisons: crude, OW F , OW X , FS F−equ , FS X−equ , FS F−uneq , and FS X−unequ (summary can be found in eTable 3).We used SAS version 9.4 to conduct covariate balancing, and statistical modeling for both empirical and simulation studies (Appendix A for analysis of one simulation), and we used R function (Appendix B) to generate simulation datasets in R version 4.3.0.

Analysis of the empirical dataset
Table 1 shows the evaluation of the seven methods using the real-world data.OW F and OW X were nearly identical and performed the best by reducing all SMD of covariates to zero, and the smallest MB over all covariates, indicating perfect balancing of covariates.The four FS methods consistently performed fairly well over all covariates, with all SMD around zero.Among them, the two FS with equal weights between groups (FS F−equ and FS X−equ ) were closer to each other and achieved better MB than the other two FS with unequal weights (FS F−unequ and FS X−unequ ).The crude method exhibited the worst performance, far more imbalanced.
For the final sample size used for further analysis, OW X excluded 147 subjects (0.345%) who had no matches for combinations of all binary variables.Using the full dataset, FS excluded 21 subjects (< 0.05%) whose PS were in non-overlapped regions.Distributions of PS per group were in eFigure 1.

Analysis of simulated datasets with the homogeneous treatment effect
Tables 2 and 3 showed the simulation evaluation of each method by risk level of outcome and by exposure prevalence, respectively.In most scenarios, OW F and OW X had very similar results and both performed better than the other methods.That is, both OW had small MB, rbias, SD of rbias, rMSE, relatively small SE, and relatively large coverage and coverageT.There were two exceptions.One was that the crude method had smallest SE of estimate.The reason is that model estimation by the crude method is consistent with the simulation method (i.e., two logit models with a constant and additive treatment effect).The second exception was the cases with rare outcome events (1%) and low exposures (2.5% and 10%) (Table 2), where the crude method had the smallest rbias, SD of rbias, and rMSE compared to the others.The reason is that both rare events and low exposures resulted in complete separation or quasi-complete separation of data points that caused model estimation to be unstable [33,34].After removing these simulated samples, both OW had smaller rbias and larger coverage than the others (eTable 4).
Similar to the empirical study, the four FS methods were quite close to each other.The two FS with equal weights had smaller MB than the two FS with unequal weights.However, each pair of FS using the data with the same sample sizes (either full or reduced datasets) had almost the same model estimations.These indicate that the two ATE weighting methods had minor difference in balancing values but almost identical values in model estimation.The two FS using the full datasets generally had better model estimations than the two using the reduced datasets.
In the criteria, there were different change patterns over simulations.As outcome risk (Table 2) or exposure prevalence (Table 3) increased, the power increased, SE, SD, rMSE, and both coverages decreased, and significance increased.Coverages decreased as outcome risk (or exposure prevalence) increased because smaller SE resulted in a narrower CI of effect that were too narrow to cover the true effect.MB, unrelated to model estimation, remained stable in a method when outcome risk  Footnotes: 1. Crude = summarized by raw data without any balancing method; OW = overlap weighting method; FS = propensity score based fine stratification method 2. 'F' = a full set of data; 'X' = a subset of data after removing those unmatched 3. 'equ' = ATE with the equal weighting between groups; 'unequ' = ATE with the unequal weighting, where total weight in one group equivalent to the sample size in that group 4. The best values are bolded and can be used to guide which method performs the best per evaluation criterion 5. MB = Mahalanobis balance; rBias = relative bias = 100*(estimated effect -true effect) /true effect; SE = average estimated standard error; SD(rBias) = empirical standard deviation of relative bias x 100; rMSE = square root of mean squared error that combines squared bias (not relative bias) and its variance; Coverage = proportion of samples whose 95% CI cover the true effect; CoverageT = proportion of samples whose 95% CI cover the true effect but not zero; Significance = proportion of samples obtaining a significant effect (by a weighted GLM with a two-sided p-value < 0.05); N used = average total sample size that was used further for GLM.
increased, and reduced greatly when exposure prevalence increased.On the other hand, rbias increased when outcome risk increased, and remained similar in a method when exposure increased.
To determine whether the simulation results were due to real differences or Monte Carlo error (MCE), we calculated MCE for both MB and rbias (eTables 5-6).We evaluated the number of simulations needed (Appendix Footnotes: *The simulation scenario with 1% outcome risk and 2.5% exposure prevalence is not shown here because it has been shown in Table 1 1.Crude = summarized by raw data without any balancing method; OW = overlap weighting method; FS = propensity score based fine stratification method 2. 'F' = a full set of data; 'X' = a subset of data after removing those unmatched 3. 'equ' = ATE with the equal weighting between groups; 'unequ' = ATE with the unequal weighting, where total weight in one group equivalent to the sample size in that group 4. The best values are bolded and can be used to guide which method performs the best per evaluation criterion 5. MB = Mahalanobis balance; rBias = relative bias = 100*(estimated effect -true effect) /true effect; SE = average estimated standard error; SD(rBias) = empirical standard deviation of relative bias x 100; rMSE = square root of mean squared error that combines squared bias (not relative bias) and its variance; Coverage = proportion of samples whose 95% CI cover the true effect; CoverageT = proportion of samples whose 95% CI cover the true effect but not zero; Significance = proportion of samples obtaining a significant effect (by a weighted GLM with a two-sided p-value < 0.05); N used = average total sample size that was used further for GLM.
B) and found that 500 simulations were enough for most settings of outcome and exposure.

Analysis of simulated datasets with heterogeneous treatment effect
Tables 4 and 5 and eTables 7 and 8 showed the evaluation results due to sex(age)-dependent treatment effects.Similar to the results with the constant treatment effect, the two OW methods had very similar results and both performed better than the other methods in terms of MB, rbias, rMSE, coverages, and significance.Also the same to the homogeneous cases, there were two exceptions.One was that the crude method had smallest SE of estimate.The other was that FS had smaller rMSE than OW in the case with 1% outcome and 10% or 10.55% exposure.It was also due to the issue of complete separation or quasi-complete separation of data points in a few simulated samples.After removing those samples, the OW methods still performed the best (eTable 9).All change patterns across scenarios were consistent to those in the homogeneous cases.

Discussion
Both OW and FS methods performed well among PSbased balancing methods for causal inference.To our knowledge, our study is the first to compare OW and FS as the two types of PS-based balancing methods: weighting and stratification.We used a real-world and simulated claims data for their relative performance.We included simulations of rare outcome and/or exposure, not rare in a claims-based observational study.We simulated data for both homogeneous and heterogeneous treatment effects.The OW method obtained nearly perfect covariate balance and performed much better in covariate balance, model bias, and model precision and coverages than FS.The target of inference (estimand) we focused on was ATE due to the nature of the intervention in the realworld example where the intervention was feasible to treat all eligible patients.The target of inference by OW is a special ATE, called ATO.OW is part of a class of balancing weights that target a judiciously chosen subpopulation from which an estimand is closely related to ATE [14].OW produces equal total weights between groups by its definition, i.e., making the two groups overlapped in terms of PS values.For the FS method, we evaluated the two published weighting algorithms for ATE estimation: with and without equal total weights between groups.We found that the ATE-equ performed better than ATE-unequ in terms of covariate balance (SMD and MB) but both algorithms had almost identical model estimation in terms of bias and precision.In its formula, compared with ATE-equ, ATE-unequ includes a group sample size in its numerator to a subject in that group.This additional piece was designed to normalize and stabilize weights by limiting unduly large weights [20].However, the additional piece unequaled total weights between groups, reducing covariate balance slightly, but did not affect model bias and precision.
We assume that our study met all the key assumptions for causal inference, including the stable unit treatment value assumption, the consistency assumption, and the positivity assumption [14,35].However, practical violations of the positivity assumption occur when some subjects almost always (or almost never) receive treatment [14], for example, those unmatched in combinations of binary covariates.Our study explored if removing those unmatched helped covariate balance and model estimation.This was a minor matter in our case, maybe because the proportion of those removed was very low, about 0.34% of the whole population.Using the reduced data compared to using the full data, the simulation results showed slightly smaller covariate imbalance, but slightly larger model bias and imprecision.That is, although covariate balance is slightly reduced by allowing those clearly unmatched subjects between groups, larger sample size kept model estimation less biased and imprecise, especially with infrequent outcome and/or exposure.In addition, FS further removed some subjects with extreme PS, due to their PS not in the overlapped PS region between groups.However, comparing FS with OW which did not remove any subjects, we confidently state that given balanced PS values between groups, including mismatched subjects does not affect model estimation in settings with infrequent outcome/exposure.
In our weighted GLM analysis, we used the clusterrobust method to estimate SE of the intervention effect.It is inaccurate to use the delta method, the default modelbased method, when using matching weights, because it assumes weights are frequency weights rather than probability weights [28].
In simulation results for both homogeneous and heterogeneous scenarios, we observed that as outcome risk level increased, bias increased.Higher risks and stronger correlations among exposure, outcome, and covariates led to larger bias in effect estimation [36].That is, higher confounding, which we did not adjust for in analysis, caused more bias.Among 17 covariates, more than half of them were confounders, i.e., associated with hospitalization rate.As outcome risk increased, these confounders had more confounding effect that resulted in larger bias.Adjusting for those confounders could have improved model precision and accuracy.However, we purposely did not adjust further for them in the modeling stage because in the real-world example, investigators did not know which covariates were real confounders.
In simulation results, we also observed that as exposure prevalence increased, MB values in the crude method Footnotes: * The simulation scenario with 1% outcome risk and 2.5% exposure prevalence is not conducted due to both rare event and rare exposure that resulted in the issue of complete separation or quasi-complete separation of data points (shown in Table 2) 1. Crude = summarized by raw data without any balancing method; OW = overlap weighting method; FS = propensity score based fine stratification method 2. 'F' = a full set of data; 'X' = a subset of data after removing those unmatched 3. 'equ' = ATE with the equal weighting between groups; 'unequ' = ATE with the unequal weighting, where total weight in one group equivalent to the sample size in that group 4. The best values are bolded and can be used to guide which method performs the best per evaluation criterion 5. MB = Mahalanobis balance; The rbias, relative bias, was calculated as 100*(estimated effect -true effect)/true effect; SE = average estimated standard error; SD(rBias) = empirical standard deviation of relative bias x 100; rMSE = square root of mean squared error that combines squared bias (not relative bias) and its variance; Coverage = proportion of samples whose 95% CI cover the true effect; CoverageT = proportion of samples whose 95% CI cover the true effect but not zero; Significance = proportion of samples obtaining a significant effect (by a weighted GLM with a two-sided p-value < 0.05); N used = average total sample size that was used further for GLM.
6.The true sex-dependent treatment effect was 63.59%, calculated by the observed female proportion (63.59%) times true effect (= 1) decreased.One possible reason is that higher exposure, and stronger correlations between covariates and exposure, resulted in more covariate balance.Furthermore, we found that as rate of outcome and/or exposure increased, coverages decreased and even became zero.That is, when there was larger power, CIs became too narrow to cover the true effect.Their 100% significance rate confirmed the reason.Footnotes: 1. Crude = summarized by raw data without any balancing method; OW = overlap weighting method; FS = propensity score based fine stratification method 2. 'F' = a full set of data; 'X' = a subset of data after removing those unmatched 3. 'equ' = ATE with the equal weighting between groups; 'unequ' = ATE with the unequal weighting, where total weight in one group equivalent to the sample size in that group 4. The best values are bolded and can be used to guide which method performs the best per evaluation criterion 5. MB = Mahalanobis balance; The rbias, relative bias, was calculated as 100*(estimated effect -true effect)/true effect; SE = average estimated standard error; SD(rBias) = empirical standard deviation of relative bias x 100; rMSE = square root of mean squared error that combines squared bias (not relative bias) and its variance; Coverage = proportion of samples whose 95% CI cover the true effect; CoverageT = proportion of samples whose 95% CI cover the true effect but not zero; Significance = proportion of samples obtaining a significant effect (by a weighted GLM with a two-sided p-value < 0.05); N used = average total sample size that was used further for GLM.
6.The true sex-dependent treatment effect was 63.59%, calculated by the observed female proportion (63.59%) times true effect (= 1) The choices of our performance criteria were based on the guidance of metrics for covariate balance [21].The MB criterion, which considers pairwise correlations between covariates, provides new insights beyond SMD.This is the first study to use MB to evaluate OW.In some simulation settings, coverage probability could be a maximum of 100% because it is different from confidence level [32].Our study also solved the issue of some misleading results using the coverage probability as a criterion [14] by providing two coverages and one significance to replace the traditional one.
Besides OW, the FS method performed relatively well comparing the crude method.The FQHC and non-FQHC groups had significantly overlapped PS distributions, and only < 0.05% subjects were removed due to non-overlapped PS between them.Just as in Desai et al. 's study evaluating FS [8], after balancing covariates, the PS distributions became perfectly overlapped in the empirical example.This indicates that the number of strata, 20, was sufficient.
Our simulation results for constant treatment effect are consistent with Ripollone et al. 's study which also used simulated claims data [17].In the simulated outcome with risk level of 20%, 20% exposure prevalence, and a sample size of 25,000, their FS analysis had 0.054 MB, 0.07-0.08bias, and 0.178-0.172rMSE, while ours had 0.047 MB, 0.0183 bias, and 0.025 rMSE, given the sample size of > 40,000 (eTable 10).
In our study, the two study groups were quite similar in that their PS distributions were substantially overlapped.However, when comparator groups are very different, the advantages of OW are actually greatest [37].This is because the OW method will add more weight on those overlapped PS regions and fewer weights on those tailed PS regions.Given the same situation, the FS method will remove more subjects from non-overlapped regions which results in more severe bias and probably less model precision due to reduced sample size.
Our study has some limitations.First, the OW method can be used to estimate only ATE on the ATO population, but not average treatment effect on the treated population (ATT).However, two studies showed that when the exposure prevalence is small, ATO approximates ATT [12,35].Second, due to simulating rare outcome (1%) and exposure (2.5 -10.55%), some simulated samples faced the issue of complete separation or quasi-complete separation of data points that caused model estimation to be unstable.More advanced modeling methods could be used such as Firth's method [34] and Bayesian method [33].However, this is beyond the goal of the study.Third, our simulation findings may not be generalizable because our simulations were based on one empirical study.However, both OW and FS have been separately evaluated in multiple studies.Fourth, our simulation did not consider misspecifications of a PS model and/or degrees of overlap of PS distributions.However, Zhou et al [14] conducted simulations for such situations to compare the performances of OW, IPW, and other weighting methods.They found that OW was robust to these situations.One possible reason they pointed out was that the estimand of OW was not defined on the estimated but true PS and OW smoothly down-weights the influence of observations at both end of PS spectrum [14].Last, to estimate PS, we used a logistic regression, that is, a logit modeled as a linear combination of covariates.To capture complex dependency patterns between outcome and covariates, a machine learning method such as random forest may provide more accurate and less model dependent estimate of PS [38].This will be our future work.

Conclusion
As demonstrated by our analysis with real-world and extensive simulated claims data, the OW method can yield nearly-perfect covariate balance while also retaining all of the sample.Therefore, OW can enhance the accuracy of ATE estimation over FS in most cases.Balancing covariates between treatment and control groups in observational studies can be challenging, especially in settings with infrequent outcomes and exposures.Both OW and FS methods can effectively balance covariates.These two different PS-based methods have been separately evaluated against other methods [8][9][10][11][12][13][14]17] but have never been compared against each other.We found that OW generally led to better covariate balance and model precision.However, in settings with extremely rare outcomes (≤ 1%) and exposures (≤ 10%), OW performed slightly worse than FS in at least one evaluation criterion.Future studies should analyze scenarios with rare outcomes and exposures in more detail.In conclusion, OW could be considered an effective and easy-to-implement method for balancing covariates for ATE estimation in settings with infrequent but not too rare outcomes and exposures.

Abbreviations Abbreviation
Meaning ATE Average treatment effect in the total population ATE-equ The ATE method to generate equal total weights between groups ATE-unequ The ATE method to generate unequal total weights between groups ATO Average treatment effect on the overlap population ATT Average treatment effect on the treated population CI 95% confidence interval Coverage Probability of the 95% confidence interval that covers the true effect, ignoring whehter zero was covered or not CoverageT Probability of the 95% confidence interval (CI) that covers the true effect, but not zero Crude No balancing method used FQHC Federally qualified health centers FS Propensity score-based fine stratification method FS F−equ The FS method with a full set of data and subjects' weights assigned by ATE-equ

Table 1
Evaluation of OW and FS in the empirical example 5ootnotes:1.Crude = summarized by raw data without any balancing method; OW = overlap weighting method; FS = propensity score based fine stratification method 2. 'F' = a full set of data; 'X' = a subset of data after removing those unmatched 3. 'equ' = ATE with the equal weighting between groups; 'unequ' = ATE with the unequal weighting, where total weight in one group equivalent to the sample size in that group 4. MB = Mahalanobis balance.5.N used = the total sample size that was used further for GLM analysis 6. TANF = temporary assistance for needy families

Table 2
Evaluation of OW and FS methods by simulation with the constant true effect by outcome risk along with observed/ simulated exposure

Table 3
Evaluation of OW and FS methods by simulation with the constant true effect and by exposure prevalence with observed outcome risk or with simulated outcome risk of 1%

Table 4
Evaluation of OW and FS methods by simulations with sex-dependent heterogeneous treatment effect by outcome risk along with observed/simulated exposure.*

Table 4
Evaluation of OW and FS methods by simulation with the sex-dependent heterogeneous treatment effect by exposure prevalence with observed outcome risk or with simulated outcome risk of 1%